Exploring Audio Encoding for Shellcode Obscurity: A Neural Network Approach

Recently I had a shower thought, what if I could encode shellcode into audio signals via sine waves, paired with a neural network for reliable decoding under noisy conditions. This is my typical shower thoughts, or something even more raandom.

So lets begin. First the concept, convert binary shellcode into frequency-mapped sine waves stored in a .wav file. This acts as an obscurity layer—the file appears as benign audio, potentially evading signature-based detection. On the target, a custom dropper decodes it back to executable code. It's not foolproof, but it explores how multimedia formats can hide payloads. Finding vulnerabilities wasn't the goal; it was about prototyping and learning. I mostly do my development on macOS. This can be done on windows and linux too, which I've tested as well.

In cybersecurity, obscurity techniques like this draw from steganography, where data is hidden in plain sight within non-suspicious carriers. Audio files are particularly appealing because they're ubiquitous—think email attachments, shared media, or even streaming services. Unlike traditional encoding (e.g., base64 in scripts), this method leverages digital signal processing (DSP) to embed bytes in audible tones, making it harder for casual inspection. If the audio plays without raising alarms, and only a specific decoder can extract the payload, it adds a layer of deniability. Of course, advanced forensics could spot it, but that's part of the experiment.

Background on Frequency Mapping

The idea builds on frequency-shift keying (FSK), a modulation technique used in early modems and radio communications. In FSK, data bits shift between discrete frequencies to represent 0s and 1s. Here, I extended it to bytes: each 8-bit value corresponds to a unique frequency in a continuous range. Why sine waves? They're simple to generate mathematically and represent pure tones, which can be analyzed via Fourier transforms for decoding. Adding noise mimics real audio environments, like speaker distortion or background interference, forcing the decoder to be resilient.

I chose a 300-3000 Hz range to stay within human hearing but avoid very low frequencies that might clip on cheap hardware. For stealthier variants, shifting to ultrasonics (above 20 kHz) could allow silent transmission—devices like laptops often handle these frequencies without users noticing.

Step 1: The Encoding Mechanism

At the core is a simple frequency-shift keying (FSK)-inspired encoding. Each byte (0-255) maps linearly to a frequency from 300 Hz to 3000 Hz, generating a 50 ms sine wave segment at that frequency. I added Gaussian noise (standard deviation 0.01) to simulate real-world audio imperfections and make it less detectable as structured data. The sample rate is 8000 Hz, common for low-fidelity signals.

Here's the key function:

import numpy as np

SAMPLE_RATE = 8000
DURATION = 0.05
N_SAMPLES = int(SAMPLE_RATE * DURATION)

def encode_byte_to_wave(byte_val):
    t = np.linspace(0, DURATION, N_SAMPLES, endpoint=False)
    freq = 300 + (byte_val / 255.0) * (3000 - 300)  # Linear mapping to 300–3000 Hz
    wave = 0.5 * np.sin(2 * np.pi * freq * t)  # Amplitude 0.5 to prevent clipping
    noise = np.random.normal(0, 0.01, N_SAMPLES)
    return (wave + noise).astype(np.float32)

For a full sequence, I concatenate these segments and write to a .wav file using SciPy. This range (300-3000 Hz) stays audible but could shift to ultrasonics (e.g., 18-22 kHz) for inaudible transmission via speakers and microphones. The short duration per byte keeps files compact—a 64-byte payload is just over 3 seconds long. I tested various durations; shorter ones (e.g., 20 ms) increased error rates in noisy channels, while longer ones made the audio suspiciously repetitive.

Step 2: Building the Training Dataset

To train a neural network for decoding—essential for handling distortions like compression or ambient noise—I generated a synthetic dataset. It creates 50,000 samples, each a random 64-byte array encoded as audio, with matching .npy files for ground truth. Why 64 bytes? It's a common shellcode size for proofs-of-concept, like basic reverse shells or droppers.

The generation script uses tqdm for progress tracking, ensuring the process is efficient even on modest hardware. Random bytes ensure diversity, covering the full frequency spectrum.

This produces ~3.2-second .wav files per sample (64 bytes * 0.05 s). The random bytes mimic variable shellcode patterns, training the model on diverse frequencies. To augment the data, I could have added variations like reverb or MP3 compression, but for the initial phase, noise was sufficient.

Step 3: Training the Neural Network

For decoding, I used a convolutional neural network (CNN) in PyTorch, processing spectrograms of the audio (via Short-Time Fourier Transform) to predict byte sequences. Why CNN? They excel at feature extraction from grid-like data, like spectrograms, where time and frequency axes reveal patterns.

Input: Mel-spectrograms (shape [batch, channels=1, time, freq]); output: 64-byte vectors. Mel-scale mimics human hearing, improving robustness to pitch variations.

I preprocessed .wav to spectrograms, normalized, and trained with MSE loss. Batch size 32, Adam optimizer (lr=0.001), 10 epochs on a GPU. Validation split (20%) showed low error rates (~0.5% bit flips under moderate noise).

A simplified model snippet:

import torch
import torch.nn as nn
import torchaudio

class AudioDecoder(nn.Module):
    def __init__(self):
        super(AudioDecoder, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc = nn.Sequential(
            nn.Linear(64 * (spectrogram_height // 4) * (spectrogram_width // 4), 512),
            nn.ReLU(),
            nn.Linear(512, 64)  # Output 64 bytes
        )

    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        return torch.sigmoid(self.fc(x)) * 255  # Scale to 0-255

# Training loop example
model = AudioDecoder()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(10):
    for wav_path, npy_path in dataset_pairs:
        audio, _ = torchaudio.load(wav_path)
        spec = torchaudio.transforms.MelSpectrogram()(audio)
        labels = torch.from_numpy(np.load(npy_path)).float() / 255
        optimizer.zero_grad()
        outputs = model(spec.unsqueeze(0).unsqueeze(0))  # Add batch/channel dims
        loss = criterion(outputs, labels.unsqueeze(0))
        loss.backward()
        optimizer.step()

BatchNorm stabilized training, preventing gradient issues. The model achieved 95% accuracy on clean data, dropping to 85% with added distortions—good enough for PoC. During training, I monitored for overfitting using early stopping, but the large dataset helped. Alternatives like RNNs (e.g., LSTM) were considered for sequential dependencies, but CNNs performed better on spectrograms.

For comparison, a non-ML baseline using Fast Fourier Transform (FFT) decoded clean audio perfectly but failed under 10% noise—highlighting the NN's advantage.

Step 4: Test Encoding and Dropper

A separate script handles test cases, like embedding "Hello World!" shellcode. This simulates a real payload: assembly for writing to stdout, followed by the string and NOP padding.

The dropper uses the trained model for baseline decoding, executing the output if valid. In tests, I played the .wav through speakers, and decoded—success rate of ~90%. I created a dropper targeting windows which would open a thread, then crash the program. It was actually fun to see it work. However, one caveat must be issued, the 64 bytes is a bit small, and the model is not perfect. So if the shellcode is not 64 bytes, the model will fail to decode it correctly. If an attacker wanted to use this, they would need to use a larger dataset, and a more complex model, with more bytes.

Implications for Security

This highlights audio as a stealth vector: .wav files bypass many scanners since they're not executables. In air-gapped setups, playback via speakers and microphone capture enables covert delivery. Social engineering could involve "funny sound clips" in emails. I tested a encdoded .wav on virus total and of course it was not detected.

However, implications cut both ways. Offensively, it evades static analysis; defensively, it underscores needs for behavioral monitoring or spectral anomaly detection. Tools like ML-based stego detectors could counter it. Ethically, this is red-team research—useful for testing defenses, but risky if misused.

Broader context: Similar techniques appear in academic papers on acoustic covert channels (e.g., 2010s research on ultrasonic data transfer). Malware like 2019's audio-stego campaigns embedded miners in WAVs, but using sine waves and NN decoding adds a modern twist.

Lessons Learned

Noise addition improves realism but complicates decoding; NN handles it better than rigid thresholds.

macOS permissions limit raw execution—need entitlements for droppers.

Dataset size matters; 50k samples prevented overfitting.

VMs are finicky; direct hardware tests next, with safeguards.

Frequency collisions: Linear mapping works, but non-linear (e.g., logarithmic) might reduce errors in high-byte values.

Performance: Encoding is fast (~ms per file); decoding with NN takes seconds on CPU, faster on GPU.

Future Plans

Honestly I don't plan to release any of the code besides what's written here. I don't plan to release the dataset, nor any of the shellcode samples. If this appraoch was refined a bit more, it could be weaponized for malicious purposes, and I don't want that.

Final Thoughts

This project isn't revolutionary but demonstrates practical obscurity in a fun way. Encoding shellcode as sine waves and using NN for decoding was a solid learning curve in DSP and ML, bridging theory with code. No major breakthroughs, but it's satisfying to see a .wav "hide" executable logic. If security tinkering appeals, try adapting it—safely.